Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] fix issue with identifiers in the LCA code #1347

Merged
merged 9 commits into from
Feb 26, 2021
Merged

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Feb 24, 2021

This fixes some problems with identifiers due to the changes to SourmashSignature.name in #1179.

This should probably go into 4.0.0...

Brief writeup of my understanding of the issues so far:

  • sig.name can be empty if no name is given on the signature, but this then causes problems in both sourmash lca index and LCA_Database.insert(...)
  • this also revealed a problem in the reporting on identifiers in sourmash lca index - if the identifiers are identical (in this case '', but in whatever case) then the number of leftover signatures gets mis-reported.

A particularly simple situation where this arises is in the test test_index_empty_sketch_name, where we:

  1. sketch some DNA sequences using sourmash sketch dna, without providing a name;
  2. try to index them with lca index;
  3. sig.name is now empty, which triggers a problem in LCA_Database.insert(...) with duplicate identifiers;
  4. the number of unused identifiers tracked in record_no_lineage is now miscounted as 1 instead of 2, because we used a set.

This PR fixes the following issues:

  • sourmash sketch and sourmash compute no longer create empty signatures from empty files and stdin;
  • sourmash sketch and sourmash compute set sig.filename to empty string when filename is -;
  • in sourmash lca index, record_no_lineage is now a list instead of set;
  • in sourmash lca index, ident is taken from sig.filename if sig.name is empty.
  • in LCA_Database, ident defaults to str(sig)

TODO:

Checklist

  • Is it mergeable?
  • make test Did it pass the tests?
  • make coverage Is the new code covered?
  • Did it change the command-line interface? Only additions are allowed
    without a major version increment. Changing file formats also requires a
    major version number increment.
  • Was a spellchecker run on the source code and documentation after
    changes were made?

@ctb ctb added the 4.0 issues to address for a 4.0 release label Feb 24, 2021
@codecov
Copy link

codecov bot commented Feb 24, 2021

Codecov Report

Merging #1347 (5e73f0a) into latest (6623151) will increase coverage by 5.36%.
The diff coverage is 97.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           latest    #1347      +/-   ##
==========================================
+ Coverage   88.74%   94.11%   +5.36%     
==========================================
  Files         123       96      -27     
  Lines       18124    14593    -3531     
  Branches     1399     1405       +6     
==========================================
- Hits        16084    13734    -2350     
+ Misses       1801      621    -1180     
+ Partials      239      238       -1     
Flag Coverage Δ
python 94.11% <97.11%> (+0.04%) ⬆️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/lca/lca_db.py 92.40% <66.66%> (+0.55%) ⬆️
src/sourmash/command_compute.py 94.96% <91.66%> (-0.57%) ⬇️
src/sourmash/lca/command_index.py 91.53% <100.00%> (+1.16%) ⬆️
src/sourmash/sig/__main__.py 90.79% <100.00%> (ø)
tests/test_cmd_signature.py 100.00% <100.00%> (ø)
tests/test_lca.py 99.87% <100.00%> (+<0.01%) ⬆️
tests/test_sourmash_sketch.py 100.00% <100.00%> (ø)
src/core/src/lib.rs
src/core/src/ffi/mod.rs
src/core/src/errors.rs
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6623151...5e73f0a. Read the comment docs.

@ctb ctb changed the title [WIP] fix issue with identifiers in the LCA code [MRG] fix issue with identifiers in the LCA code Feb 25, 2021
@ctb
Copy link
Contributor Author

ctb commented Feb 25, 2021

I think this is ready, @luizirber and @bluegenes. And I think it should go into 4.0 release, because it is a problem that is triggered by #1179, which is one of the "breaking backwards compatibility" PRs going into 4.0.

@ctb ctb mentioned this pull request Feb 25, 2021
3 tasks
tests/test_lca.py Outdated Show resolved Hide resolved
@ctb
Copy link
Contributor Author

ctb commented Feb 26, 2021

Fixed! thanks @bluegenes!

@ctb ctb merged commit 47a294b into latest Feb 26, 2021
@ctb ctb deleted the fix_lca_index_ident branch February 26, 2021 03:08
@ctb
Copy link
Contributor Author

ctb commented Feb 26, 2021

thanks!

@ctb ctb mentioned this pull request Mar 2, 2021
ctb added a commit that referenced this pull request Mar 2, 2021
ctb added a commit that referenced this pull request Mar 2, 2021
* update release docs for latest build config

* update release notes to include #1347/lca fixes

* add note about database compatibility

* add numerical/results stability

* update docs to include example of sourmash sketch stdin

* update command line example with better clustering output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4.0 issues to address for a 4.0 release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants